Sains Malaysiana 52(12)(2023): 3879-3892
http://doi.org/10.17576/jsm-2023-5212-19
An
Efficient Method of Identification of Influential Observations in Multiple
Linear Regression and Its Application to Real Data
(Kaedah yang Cekap bagi Pengecaman Cerapan Berpengaruh dalam Model Regresi Linear Berganda dan Kegunaannya dalam Set Data Sebenar)
HABSHAH MIDI1,* , HASAN TALIB HENDI1 , HASSAN URAIBI2, JAYANTHI
ARASAN 3 & SHELAN SAIED ISMAEEL4
1Institute
for Mathematical Research, Universiti Putra Malaysia,
43400 UPM Serdang, Selangor, Malaysia
2Department of
Statistics, University of Al-Qadisiyah, IRAQ
3Department of Mathematics &
Statistics, Universiti Putra
Malaysia, 43400 UPM Serdang, Selangor, Malaysia
4Department
of Mathematics, Faculty of Science, University of Zakho,
Iraq
Received: 20 June
2023/Accepted: 14 November 2023
Abstract
Influential
observations (IOs) are those observations which either alone or together with
several other observations have detrimental effect on the computed values of
various estimates. As such, it is very important to detect their presence.
Several methods have been proposed to identify IOs that include the fast
improvised influential distance (FIID). The FIID method has been shown to be
more efficient than some existing methods. Nonetheless, the shortcoming of the
FIID method is that, it is computationally not stable, still suffers from
masking and swamping effects, time consuming issues and not using proper
cut-off point. As a solution to this problem, a new robust version of
influential distance method (RFIID) which is
based on Reweighted Fast Consistent and High Breakdown (RFCH) estimator is proposed. The results
of real data and Monte Carlo simulation study indicate that the RFIID able to
correctly separate the IOs from the rest of data with the least computational running times, least
swamping effect and no masking effect compared to the other methods in this
study.
Keywords: Good leverage point;
influential distance; influential observations; Reweighted Fast Consistent and High Breakdown (RFCH) estimator
Abstrak
Cerapan berpengaruh (IO) ditakrifkan sebagai cerapan sama ada bersendirian atau bersama dengan beberapa cerapan lain yang mempunyai kesan memudaratkan ke atas nilai kiraan pelbagai anggaran. Oleh itu, sangat penting untuk mengecam kehadiran cerapan berpengaruh. Beberapa kaedah telah dicadangkan untuk mengecam IO termasuk kaedah penambahbaikan jarak berpengaruh pantas (FIID). Kaedah FIID telah ditunjukkan lebih cekap dibandingkan dengan kaedah sedia ada. Walau bagaimanapun, kaedah FIID mempunyai kelemahan iaitu pengiraannya tidak stabil, masih mempunyai kesan penyorokan dan limpahan, isu masa pengiraan yang panjang dan tidak menggunakan titik genting yang betul. Kaedah teguh versi baharu bagi jarak berpengaruh yang berasaskan penganggar berpemberat konsisten pantas dan titik musnah tinggi (RFIID) dicadangkan untuk mengatasi masalah ini. Keputusan data sebenar dan kajian simulasi Monte Carlo menunjukkan RFIID berupaya untuk mengasingkan IO daripada keseluruhan data dengan masa pengiraan paling singkat, kesan limpahan paling kecil tanpa kesan penyorokan dibandingkan dengan kaedah lain dalam kajian ini.
Kata kunci: Cerapan berpengaruh; jarak berpengaruh; penganggar pantas tekal berpemberat dan titik musnah tinggi; titik tuasan baik
REFERENCES
Midi et al. (2020)
Belsley, D., Kuh, E. & Welsch, R. 2004. Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity. Hoboken, New Jersey: John
Wiley & Sons, Inc.
Chatterjee, S. & Hadi,
A.S. 1986. Influential observations, high leverage points, and outliers in
linear regression. Statistical Science 1(3): 379-393.
Devlin, S.J., Gnanadesikan,
R. & Kettenring, J.R. 1981. Robust estimation of
dispersion matrices and principal components. Journal of the American
Statistical Association 76(374): 354-362.
Gunst, R.F. & Mason, R.L. 1980. Regression Analysis and Its
Application: A Data Oriented Approach. New York: Marcel Dekker.
Habshah, M. & Shabbak,
A. 2011. Robust multivariate control charts to detect small shifts in mean. Mathematical Problems in Engineering 2011: 923463. doi: 10.1155/2011/923463
Habshah, M., Muhammad, S. & Ismaeel, S.S. 2021.
Fast improvised influential distance for the identification of influential
observations in multiple linear regression. Sains Malaysiana 50(7): 2085-2094.
Habshah, M., Talib, H., Jayanthi, A. & Uraibi, H.S.
2020. Fast and robust diagnostic technique for
the detection of high leverage points. Journal
of Science and Technology 28(4): 1203-1220.
Habshah, M., Norazan,
M.R. & Rahmatullah Imon,
A.H.M. 2009. The performance of diagnostic-robust generalized potentials for
the identification of multiple high leverage points in linear regression.
Journal of Applied Statistics 36(5): 507-520.
Hampel, F.R., Ronchetti, E.M., Rousseeuw,
P.J. & Stahel, W.A. 2011. Robust Statistics: The Approach based on Influence Functions. Hoboken, Ney Jersey: John
Wiley & Sons, Inc.
Mohammed, A., Habshah, M.
& Rahmatullah Imon,
A.H.M. 2015. A new robust diagnostic plot for classifying good and bad high
leverage points in a multiple linear regression model. Mathematical
Problems in Engineering 2015: 279472. doi.org/10.1155/2015/279472
Nurunnabi, A.A.M., Nasser, M. & Imon,
A.H.M.R. 2016. Identification and classification of multiple outliers, high
leverage points and influential observations in linear regression. Journal
of Applied Statistics 43(3): 509-525.
Olive, D.J. & Hawkins, D.M. 2010. Robust
Multivariate Location and Dispersion. Preprint, www. Math. Siu.
Edu/olive/preprints. Htm
Olive, D.J. & Hawkins, D.M. 2008. High
Breakdown Multivariate Estimators. https://www.researchgate.net/profile/David_Olive2/publication/240737720_High_Breakdown_Multivariate_ Estimators/links/ 0a85e53234b7db7f90000000.pdf
Rahmatullah Imon, A.H.M. 2005.
Identifying multiple influential observations in linear regression. Journal
of Applied Statistics 32: 929-946.
Rahmatullah Imon,
A.H.M. 2002. Identifying multiple high leverage points in linear
regression. Journal of Statistical Studies 3: 207-218.
Rashid,
A.M., Midi, H., Dhnn, W. & Arasan, J. 2021a.
An efficient estimation and classification methods for high dimensional data
using robust iteratively reweighted SIMPLS algorithm based on Nu-Support vector
regression. IEEE Access 9: 45955-45967.
Rashid,
A.M., Midi, H., Dhnn, W. & Arasan, J. 2021b.
Detection of outliers in high-dimensional data using Nu-Support vector
regression. Journal of Applied Statistics 49(10): 2550-2569.
Rousseeuw, P. & Leroy, A.M.
1987. Robust Regression and Outlier Detection. New York: Wiley Series in
Probability and Mathematical Statistics.
Rousseeuw, P. & Yohai, V. 1984. Robust regression by means of S-estimators. In Robust and Nonlinear Time Series Analysis. New York: Springer.
Welsch, R.E. 1980. Regression sensitivity analysis and bounded-influence
estimation. In Evaluation of Econometric Models, edited by Kmenta, J. & Ramsey, J.B. Massachusetts: Academic Press. pp. 153-167.
Zahariah, S. & Midi, H. 2022. Minimum
regularized covariance determinant and principal component analysis-based
method for the identification of high leverage points in high dimensional
sparse data. Journal of Applied Statistics 50(13): 2817-2835.
*Corresponding
author; email: habshah@upm.edu.my
|